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1. Introduction 


Generative Adversarial Networks (GANs) are capable 
of generating high-quality images; however, the resolu- 
tion of generated images remains relatively small. There 
were many efforts to address this issue. For example, Pro- 
GAN trains high-resolution GANs in the single-class set- 
ting by iteratively training across a set of increasing reso- 
lutions. Nevertheless, the model training is still unstable 
regardless of the large number of studies that have investi- 
gated and proposed improvements. Without auxiliary sta- 
bilization techniques, this training procedure is notoriously 
brittle, requiring finely-tuned hyperparameters and architec- 
tural choices to work at all. Most of the improvements have 
been made due to changes the objective function or con- 
straining the discriminator model during the training. More 
recently, scaling up GAN models has been found to work 
pretty well for generating both high-quality and larger im- 
ages. 


The authors provide class information to the Generator 
with class-conditional BatchNorm, as seen in the image 
(sub-figure (a) and (b)) above, and to the Discriminator with 
projection. They also use Orthogonal Inialization instead 
of classic Xavier Initialization or N(O, 0.021). BatchNorm 
Statistics in G are computed across all devices instead of 
per-devices, which is a typical scenario. They note that pro- 
gressive growing, as ProGAN, is unnecessary. Simply by 
increasing the batch size by a factor of 8 improved their per- 
formance, in terms of Inception Score (IS), by 46%. They 
explain it that it provides better gradients for both networks. 
Also, they achieved a better final performance in fewer iter- 
ations. They then increase number of channels, in CNNs, in 
each layer by 50% (meaning the number of parameters are 
almost doubled). It resulted in 21% improvement in terms 
of IS. Notice from the figure above that class embeddings 
are shared and they use separate linear layers to fit each 
BatchNorm layer. It reduces computation cost a lot and im- 
proves training speed by 37Notice the noise vector z is split 
into one chunk per ResBlock and conctaenated with class 
embedding c. It gave a slight improvement of 4 Also, if you 
wonder what Non-local block is, here’s is the diagram 


2. Related Work 


In the earlier I2I works [24], researchers used many 
aligned image pairs as the source domain and target domain 
to obtain the translation model that translates the source im- 
ages to the desired target images. Unsupervised I2I Train- 
ing supervised translation is not very practical because of 
the difficulty and high cost of acquiring these large, paired 
training data in many tasks. Taking photo-to-painting trans- 
lation as an example (e.g., f. in Fig, it is almost impos- 
sible to collect massive amounts of labeled paintings that 
match the input landscapes. Hence, unsupervised methods 
[76, 27, 63] have gradually attracted more attention. In an 
unsupervised learning setting, I2I methods use two large 
but unpaired sets of training images to convert images be- 
tween representations. Semi-supervised I2I In some spe- 
cial scenarios, we still need a little expensive human label- 
ing or expert guidance, as well as abundant unlabeled data, 
such as those of old movie restoration [43] or genomics 
[52]. Therefore, researchers consider introducing semi- 
supervised learning [28, 48, 5] into I2I to further promote 
the performance of image translation. Semi-supervised 
I2I approaches leverage only source images alongside a 
few source-target aligned image pairs for training but can 
achieve more promoted translated results than their unsu- 
pervised counterpart. Few-shot I2I Nonetheless, several 
problems remain regarding translation using a supervised, 
unsupervised or semi-supervised I2I method with extremely 
limited data. In contrast, humans can learn from only one 
or limited exemplars to achieve remarkable learning re- 
sults. As noted by meta-learning [73, 57] and few-shot 
learning [53, 58], humans can effectively use prior expe- 
riences and knowledge when learning new tasks, while arti- 
ficial learners usually severely overfit without the necessary 
prior knowledge. Inspired by the human learning strategy, 
few- and one-shot I2I algorithms [38, 34, 35, 36] have been 
proposed to translate from very few (or even one) in the 
limit unpaired training examples of the source and target 
domains. 


Although learning settings may differ, most of these I2I 
techniques tend to learn a deterministic one-to-one map- 
ping and only generate single-modal output, as shown in 
Fig.. However, in practice, the two-domain I2I is inher- 
ently ambiguous, as one input image may correspond to 
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Figure 15: (a) A typical architectural layout for BigGAN’s G; details are in the following tables. 
(b) A Residual Block (ResBlock up) in BigGAN’s G. (c) A Residual Block (ResBlock down) in 


BigGAN’s D. 


Figure 1. 


multiple possible outputs, namely, multimodal outputs, as 
shown in Fig.. Multimodal I2I translates the input im- 
age from one domain to a distribution of potential outputs 
in the target domain while remaining faithful to the input. 
These diverse outputs represent different color or style tex- 
ture themes (i.e., multimodal) but still preserve the similar 
semantic content as the input source image. Therefore, we 
actually view multimodal I2I as a special two-domain I2I 
and discuss it in supervised and unsupervised settings (sub- 
section). 


Most of computer visions problems can be seen as an 
image-to-image translation problem, mapping an image 
from one domain to another image in different domain. As 
an illustration, super-resolution can be viewed as a con- 
cern of mapping a low-resolution image to a similar high- 
resolution one; image colorization is a problem of map- 
ping a gray-scale image to a corresponding color one. The 
problem can be investigated in supervised and unsupervised 
learning methods. In the supervised approaches, paired of 
images in various domains are available [24]. In the un- 
supervised models, only two separated sets of images are 
available in which one composed of images in one domain 
and the other composed of different domain images—there 
is no paired samples representing how an image can possi- 
bly translated to a corresponding image in different domain. 
For lack of corresponding images, the unsupervised image- 
to-image translation problem is considered more difficult, 
but it is more feasible because training data collection is 
easier. 


When assessing the image translation problem from a 
likelihood viewpoint, the main challenge is to learn a mu- 
tual distribution of images in different domains. In the un- 
supervised setting, the two sets composed of images from 
two minor distributions of different domains, and the task is 
to gather the cooperative distribution by utilizing these im- 
ages. However, driving the joint distribution from the minor 
distributions is extremely ill-posed problem. In this section, 
we discuss the image-to-image translation methods. Image- 
to-image translation is similar to style transfer, which as the 
input receives a style image and a content image. The model 
output is an image that has the content of the content im- 
age and the style of the style image. It is not only trans- 
ferring the images’ styles, but also manipulates features of 
objects. This section lists several models that are proposed 
for image-to-image translation from supervised methods to 
unsupervised ones. Figure shows sample generate results 
by [24]. 


2.1. Supervised Translation 


Isola et al. [24] proposed to merge the different network 
losses of Adversarial Network with Lı regularization loss, 
therefore the particular generator not only trained to pass 
the discriminator filtering but also to produce images that 
contain realistic objects and similar to the ground-truth im- 
ages. Lı generates less blurry images as compared to Lz, it 
was the reason for using Lı. The conditional GAN loss is 
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Figure 2. A spacetime non-local block. The feature maps are 
shown as the shape of their tensors, e.g., T x H xW x 1024 for 
1024 channels (proper reshaping is performed when noted). “@” 
denotes matrix multiplication, and “@” denotes element-wise sum. 
The softmax operation is performed on each row. The blue boxes de- 
note | x 1x1 convolutions. Here we show the embedded Gaussian 
version, with a bottleneck of 512 channels. The vanilla Gaussian 
version can be done by removing @ and @, and the dot-product 
version can be done by replacing softmax with scaling by 1/N. 
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formulated as: thereby Lı loss for pressuring self-similarity is defined as: 


Leagan (G, D) = Eey\paatalt,y) [log D(z, y)|+ 


)  &,(@=E, 
Ez~paata (2) zp- (2) log — D(x, G(x, z))]. z, (G) 


Y~Paata (ay)? ~ Pe(2) [lly — G(a, 2)lla], 
(2) 


in which x, y ~ p(x, y) denotes to the images that have dif- the general objective is specified by: 
ferent styles but belong to the same scene, similar to the 
standard GAN [18], z ~ p(z) represents random noise, G*, D* = arg™S ™*D Laan (G, D) + AlL, (G) (3) 


Figure 2: (a) The effects of increasing truncation. From left to right, the threshold is set to 2, 1, 0.5, 
0.04. (b) Saturation artifacts from applying truncation to a poorly conditioned model. 
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Figure 15: (a) A typical architectural layout for BigGAN’s G; details are in the following tables. 
(b) A Residual Block (ResBlock up) in BigGAN’s G. (c) A Residual Block (ResBlock down) in 


BigGAN’s D. 


Figure 4. 


in which the hyperparameter of A is used to balance the two 
loss functions. Moreover, in [24], the authors pointed out 
that, the noise z does not have noticeable influence on the 
result, therefore, they proposed to use the noise in the form 
of dropout during training and test in place of samples that 
belongs to random distribution. In this model, the structure 
of the G is based on the new structure of U-Net that has 
multi-scale connections to join each encoder layer to the 
same layer decoder for sharing low-level information like 
edges of objects. In [24] the authors proposed Pat chGAN. 
The proposed model rather than classifying the whole im- 


age attempts to classify the N x N path of each image and 
seek the average scores of patches for obtaining the final 
score of the image. From the experiments it has been ob- 
served, for obtaining the high frequency details, it is suffi- 
cient to limit the discriminator to focus on the local patches. 


Yoo et al. proposed an algorithm for supervised image- 
to-image translation, while having a secondary discrimina- 
tor Dpair that evaluates whether or not a pair of images from 
multiple domains is related with each other. The loss of 


D pair is calculated as follows: 


lnair = —t log[Dpair (Xs, X)|] 
+(t = 1) log[1 = Dpair(Xs, X)], 


0 ifxX =X; (4) 
stt=0 ifX =X, 
0 ifX =X; 


where the input image from the source domain is repre- 
sented by X, and its groundtruth image is denoted by X; 
in the target domain, an irrelevant image in the target do- 
main is represented by Xz. The generator in the proposed 
model transfers X, into a single image Å, in the associ- 
ated domain. The authors proposed an efficient pyramid 
adversarial networks to generating synthetic labels based 
on target domains for road segmentation in remote sensing 
images. Zareapoor et al. proposed a semi-supervised adver- 
sarial networks for dataset balancing in mechanical devices. 
The authors integrate multi-instance learning into adversar- 
ial networks for human pose estimation. As the results 
show, the proposed model has high accuracy and fast perfor- 
mance. Shamsolmoali et al. to handle the imbalanced class 
problems, proposed a capsule adversarial networks based 
on minority class augmentation. 

In, the authors proposed a general learning framework 
assign the generated samples to a distribution over a set 
of labels instead of a single label. The effectiveness of 
their proposed model is proved through a set of experi- 
ments. Zhang et al. proposed DRCW-ASEG method in 
order to generate synthetic examples for multi-class imbal- 
anced problem. The authors shown that their proposed strat- 
egy is able to improve the classification accuracy. 

there is no noise input in the generator of pix2pix. A 
novelty of pix2pix is that the generator of pix2pix learns a 
mapping from an observed image y to output image G (y), 
for example, from a grayscale image to a color image. As a 
follow-up to pix2pix, pix2pixHD [61] used cGANs and fea- 
ture matching loss for high-resolution image synthesis and 
semantic manipulation. With the discriminators, the learn- 
ing problem is a multi-task learning problem. Chrysos et al. 
[8] proposed robust CGANs. Thekumparampil et al. [60] 
discussed the robustness of conditional GANs to noisy la- 
bels. Conditional CycleGAN [39] uses cGANs with cyclic 
consistency. Mode seeking GANs (MSGANs) [40] pro- 
poses a simple yet effective regularization term to address 
the mode collapse issue for cGANs. GANs are also uti- 
lized to achieve image composition [33, 3, 69, 65], Based 
on cGANs, we can generate samples conditioning on class 
labels [45, 44], text [49, 22, 71]. In [71, 70], text to photo- 
realistic image synthesis is conducted with stacked gen- 
erative adversarial networks (SGAN) [23]. cGANs have 
been used for convolutional face generation [15], face aging 
[1], multi-modal image translation [59, 75, 67], panoramic 


image generation [14, 54], exemplar-based image synthe- 
sis [75, 72], synthesizing outdoor images having specific 
scenery attributes [25], natural image description [9], and 
scene manipulation [62]. Most cGANs based methods 
[11, 47, 51, 13, 55] feed conditional information y into the 
discriminator by simply concatenating (embedded) y to the 
input or to the feature vector at some middle layer. cGANs 
with projection discriminator [41] adopts an inner product 
between the condition vector y and the feature vector. Two- 
domain I2I can solve many problems in computer vision, 
computer graphics and image processing, such as image 
style transfer (f.) [76, 31], bounding box and keypoints 
[50, 68] which can be used in photo editor apps to promote 
user experience and semantic segmentation (c.) [46, 78], 
which benefits the autonomous driving and image coloriza- 
tion (d.) [56, 32], and domain adaptation [42, 6, 37, 66].. If 
low-resolution images are taken as the source domain and 
high-resolution images are taken as the target domain, we 
can naturally achieve image super-resolution through I2I 
(e.) [64, 74]. 


2.1.1 Multimodal Outputs 


As shown in Fig.1, multimodal I2I translates the input im- 
age from one domain to a distribution of potential outputs 
in the target domain while remaining faithful to the input. 

Actually, this multimodal translation benefits from the 
solutions of mode collapse problem [17, 2, 19], in which 
the generator tends to learn to map different input samples 
to the same output. Thus, many multimodal I2I methods 
[77, 4] focus on solving the mode collapse problem to lead 
to diverse outputs naturally. BicycleGAN [77] became the 
first supervised multimodal I2I work by combining cVAE- 
GAN [21, 29, 30] and cLR-GAN [7, 12, 13] to systemati- 
cally study a family of solutions to the mode collapse prob- 
lem and generate diverse and realistic outputs. 

Similarly, Bansal et al. [4] proposed PixelNN to achieve 
multimodal and controllable translated results in I2I. They 
proposed a nearest-neighbor (NN) approach combining pix- 
elwise matching to translate the incomplete, conditioned in- 
put to multiple outputs and allow a user to control the trans- 
lation through on-the-fly editing of the exemplar set. 

Another solution for producing diverse outputs is to use 
disentangled representation [7, 20, 26, 10] which aims to 
break down, or disentangle, each feature into narrowly de- 
fined variables and encodes them as separate dimensions. 
When combining it with I2I, researchers disentangle the 
representation of the source and target domains into two 
parts: domain-invariant features content, which are pre- 
served during the translation, and domain-specific features 
style, which are changed during the translation. In other 
words, I2I aims to transfer images from the source domain 
to the target domain by preserving content while replacing 


style. Therefore, one can achieve multimodal outputs by 
randomly choosing the style features that are often regular- 
ized to be drawn from a prior Gaussian distribution N (0, 1). 
Gonzalez-Garcia et al. [16] disentangled the representa- 
tion of two domains into three parts: the shared part con- 
taining common information of both domains, and two ex- 
clusive parts that only represent those factors of variation 
that are particular to each domain. In addition to the bi- 
directional multimodal translation and retrieval of similar 
images across domains, they can also transfer a domain- 
specific transfer and interpolation across two domains. 


3. Conclusion 


We find out that taking models trained with z N(O, I) 
and sampling from a truncated normal boosts IS and FID. 
Truncation trick: truncating a z vector by resampling the 
values having a magnitude greater than a chosen threshold. 
It leads to a better quality images in the cost of overall sam- 
ple variety. The smaller the threshold, the smaller sample 
variety. where W is a weight matrix and beta is a hyperpa- 
rameter set to le-4. They notice some of their larger models 
do not benefit from truncation trick. Therefore, they intro- 
duce Orthogonal Regularization due to which 60% of larger 
models became amenable to truncation. So, this wraps up 
our discussion of GauGAN’s architecture and it’s objective 
functions. In the next part, we talk about how GauGAN 
is trained and how does it fare as compared to it’s rival al- 
gorithms, especially it’s predecessor Pix2PixHD. Till then, 
you can checkout the GauGAN web demo, which allows 
you to create random landscapes. We see that the noise vec- 
tor z is first split into equal size chunks. First, we take 
the very first chunk (zs[0]) as input and the rest chunks 
are used for concatenation with our class conditional vec- 
tor y. After that we iterate over our ResBlock (self.blocks), 
as well as concatenated vectors, and pass our parameters. 
The final output is obtained by passing through batchnorm- 
relu-conv and tanh. Looks pretty simple, right? Now let’s 
see what happens inside our BatchNorm blocks. We see 
that our concatenated vector y is passed into self.gain and 
self.bias which are just Linear layers. So, vector y is lin- 
early projected to produce per-sample gains and biases for 
the BatchNorm layers of the block. The bias projections are 
zero-centered, while the gain projections are centered at 1. 
Therefore, we add 1 after we apply self.gain. Finally, after 
we normalize our input x, we multiply it by our computed 
gain and add bias. Some Last Words I hope I help someone 
understand the concepts of BigGAN better. Anyways, my 
articles are just to introduce you to the concepts. You can 
always read the paper and, of course, get more details from 
it. I encourage to study the paper on your own. This arti- 
cle provides a great amount of information so you the paper 
seem a little bit easier. 
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